Complete chloroplast genome sequences of Myristicaceae species with the comparative chloroplast genomics and phylogenetic relationships among them

Background Myristicaceae was widly distributed from tropical Asia to Oceania, Africa, and tropical America. There are 3 genera and 10 species of Myristicaceae present in China, mainly distributed in the south of Yunnan Province. Most research on this family focuses on fatty acids, medicine, and morphology. Based on the morphology, fatty acid chemotaxonomy, and a few of molecular data, the phylogenetic position of Horsfieldia pandurifolia Hu was controversial. Results In this study, the chloroplast genomes of two Knema species, Knema globularia (Lam.) Warb. and Knema cinerea (Poir.) Warb., were characterized. Comparing the genome structure of these two species with those of other eight published species, including three Horsfieldia species, four Knema species, and one Myristica species, it was found that the chloroplast genomes of these species were relatively conserved, retaining the same gene order. Through sequence divergence analysis, there were 11 genes and 18 intergenic spacers were subject to positive selection, which can be used to analyze the population genetic structure of this family. Phylogenetic analysis showed that all Knema species were clustered in the same group and formed a sister clade with Myristica species support by both high maximum likelihood bootstrap values and Bayesian posterior probabilities; among Horsfieldia species, Horsfieldia amygdalina (Wall.) Warb., Horsfieldia kingii (Hook.f.) Warb., Horsfieldia hainanensis Merr. and Horsfieldia tetratepala C.Y.Wu. were grouped together, but H. pandurifolia formed a single group and formed a sister clade with genus Myristica and Knema. Through the phylogenetic analysis, we support de Wilde’ view that the H. pandurifolia should be separated from Horsfieldia and placed in the genus Endocomia, namely Endocomia macrocoma subsp. prainii (King) W.J.de Wilde. Conclusion The findings of this study provide a novel genetic resources for future research in Myristicaceae and provide a molecular evidence for the taxonomic classification of Myristicaceae.


Results
In this study, the chloroplast genomes of two Knema species, Knema globularia (Lam.) Warb. and Knema cinerea (Poir.) Warb., were characterized. Comparing the genome structure of these two species with those of other eight published species, including three Horsfieldia species, four Knema species, and one Myristica species, it was found that the chloroplast genomes of these species were relatively conserved, retaining the same gene order. Through sequence divergence analysis, there were 11 genes and 18 intergenic spacers were subject to positive selection, which can be used to analyze the population genetic structure of this family. Phylogenetic analysis showed that all Knema species were clustered in the same group and formed a sister clade with Myristica species support by both high maximum likelihood bootstrap values and Bayesian posterior probabilities; among Horsfieldia species, Horsfieldia amygdalina (Wall.) Warb., Horsfieldia kingii (Hook.f.) Warb., Horsfieldia hainanensis Merr. and Horsfieldia tetratepala C.Y.Wu. were grouped together, but H. pandurifolia formed a single group and formed a sister clade with genus Myristica and Knema. Through the phylogenetic analysis, we support de Wilde' view that the H. pandurifolia should be separated from Horsfieldia and placed in the genus Endocomia, namely Endocomia macrocoma subsp. prainii (King) W.J.de Wilde.

Conclusion
The findings of this study provide a novel genetic resources for future research in Myristicaceae and provide a molecular evidence for the taxonomic classification of Myristicaceae.

Introduction
The family Myristicaceae includes about 20 genera and 500 species of evergreen trees distributed from tropical Asia and Pacific islands to Africa and tropical America; there are 3 genera and 10 species of Myristicaceae in China [1]. The seeds of some Myristicaceae plants contain about 57.39% solid oil and are mainly composed of Myristic acid and tetradecenoic acid, such that the relative average content of these two acids exceeds 90% [2,3]. Since their seeds are rich in C14 fatty acid [4], which can be used as a condensing agent in medical, beauty, cosmetics, and other industrial products, they are considered a high-quality raw material [3][4][5][6], implying a significant economic opportunity in exploring the resource of Myristaceae species. Because of the use of their oil and wood in industry, Myristicaceae plants have attracted increasing attention from researchers. The studies conducted on this family have mainly focused on their fatty acid content [2], chemical composition [7], and taxonomy [8][9][10][11][12]. We conducted a field survey for this family in Yunnan province, China and found that some species appear in small population with few individuals, such as only 1 Horsfieldia pandurifolia Hu (location: 994 6.956 0 E, 23˚12.521 0 N, altitude of 1 000m) individual has been found in the North of Xiaohei River valley, Shuangjiang county, Yunnan. The original large population of H. pandurifolia has been destroyed and only small isolated population exist in the upper of the distribution area [13]. so, it is important to conserve the species using the effective measure. However, the systematic position of some species within the family remains controversial, such as H. pandurifolia. Therefore, the precise identification and delimitation of the species was the key.
H. pandurifolia was first named by Hu in 1963 [14], and recorded as H. pandurifolia Hu in Flora Yunnanica [15], but the H. pandurifolia was incorporated into the Horsfieldia Prainii (King) Warb., and classified into Horsfieldia macrocoma (Miq) Warb. as a subspecies, and established genus Endocomia as a model species by de Wilde in 1984, namely Endocomia macrocoma ssp. prainii [9]. In 2004, Ye argued that the taxonomic boundaries of the genera Horsfieldia and Endocomia was not obvious, annulled Endocomia, restoring H. macrocoma [12], and in Flora of China (2008), the genus Endocomia was also annulled and H. pandurifolia was named as H. prainii [1]. In recent years, based on results of molecular biology approaches, the systematic position of H. pandurifolia was also differently suggested by Wu [8] and Cai [16]. it is argued about H. pandurifolia position, so it is necessary to study the relationship of Myristicaceae, especially for H. pandurifolia.
The chloroplast organelle is the most noticeable feature in plants, and its plastome is conserved than mitochondrial and nuclear genomes. Due to the low rate of nucleotide substitution, the chloroplast genome is used frequently in phylogeny studies [17], at the same time, the chloroplast genome can be provide large sequence information, so it serves as good candidates for high resolution DNA barcoding [16]. With the advent of sequencing technology such as Illumina, Nanopore, there are increasing reports of complete chloroplast genome for plant phylogenetic analysis, such as in Acanthoideae [18], Magnoliaceae [19], and Zingiberaceae [20] etc. including the phylogenetic relationships study on the Chinese Horsfieldia based on the chloroplast genome analysis [16]. However, the analysis of phylogenetic relationship for the species of Myristicaceae that distributed in China using chloroplast genome has not been reported.
In this study, we sequenced, assembled, and characterized the chloroplast genome of two species of Myristicaceae, Knema globularia (Lam.) Warb. and Knema cinerea (Poir.) Warb., using the Roche/454 sequencing platform. To analyze the organization, gene contents, patterns of nucleotide substitution, simple sequence repeats (SSRs), and phylogenetic relationships, the previously reported chloroplast genomes of eight Myristicaceae species were considered for comparative analyses with the two newly assembled chloroplast genomes. Our study aims were as follows: (1) evaluate the variations within the 10 species and the structural diversity of the chloroplast genome among Myristicaceae species; (2) upgrade our understanding of the application value of the chloroplast genome of Myristicaceae and provide novel genetic resources for future research in Myristicaceae; and (3) to provide a molecular evidence for the taxonomic classification of Myristicaceae.

Plant material and DNA extraction
The species used in this study were identified according to the records in Flora of China. The samples information, including the two newly sequenced species and eight reported species, were listed in the Table 1, and the species names in this table were derived from the Flora Reipublicae Popularis Sinicae and Flora of China. The collection sites included Nangunhe National Nature Reserve, Xishuangbanna National Nature Reserve and Yunnan Institute of Tropical Crops (YITC), Yunnan, China. All the live specimen was planting on site, so, the herbarium were not been made. Fresh leaves of Myristicaceae plants were sampled and immediately dried with silica gel, and total genomic DNA per germplasm was extracted from 100 mg dried leaves using the DNeasy Plant Mini Kit (QIAGEN, Valencia, CA, USA).

DNA sequencing
DNA concentrations were quantified using Life Invitrogen Qubit1 3.0 (Life, Invitrogen, USA). The total genomic DNA per germplasm with over 30 ng μL -1 were used for subsequent steps. Purified DNA (5 mg) was sheared by nebulization with compressed nitrogen gas, which yielded DNA fragments of 400-800 bp in length, with single strands of DNA being recovered by denaturing treatment after the modification of terminal repair and specific adaptor sequence connections. Specific proportions of single-stranded DNA libraries were immobilized on DNA capture magnetic beads for Emulsion PCR and then sequenced in GS FLX reagents. After reaction, 4-6 billion base information sites were obtained. Table 1. The information of ten samples used in this analysis.

Sample
No.

GPS coordinates
Site of specimen preservation Accession number in GenBank

Genome assembly and annotation
The chloroplast genome was assembled using CLC Genomic Workbench v3.6 (http://www. clcbio.com), the contigs were aligned (�90% similarity and query coverage) and ordered, according to the reference chloroplast genome of K.elegans which was reported by author. The genes in the chloroplast genome were predicted using the Dual Organellar GenoMe Annotator (DOGMA) program [21]. The protein-encoding genes, location of the ribosomal RNA (rRNA) and transfer RNA (tRNA) genes were annotated using BLASTX and BLASTN searches. To accurately determine the boundaries between introns and exons and the starting and stopping codon positions of protein-coding genes, the annotated results were manually examined, and the codon position was adjusted by comparing them with homologous genes in the reference genomes, namely H. pandurifolia, H. kingii and K. elegans, based on phylogenetic proximity. The chloroplast genome map was drawn using Genome Vx software [22], and the chloroplast genome sequences of K.globularia and K.cinerea have been deposited to the Gen-Bank, the accession numbers were listed in Table 1.

Genome comparison
Performing comparative analysis for the 10 Myristaceae species, the chloroplast genome of Liriodendron tulipifera L. from the same order Magnoliales was downloaded from GenBank (NC_008326.1, https://ncbi.nlm.nih.gov/search/all/?term=NC_008326.1) and used as a reference. The sequence identity of 11 species chloroplast genomes were plotted using the mVISTA program with the LAGAN mode [25]. To detect the rearrangement and inverse evolutionary events, multiple genome alignments were conducted using the progressive mauve algorithm [26], as implemented in the Geneious software package (Biomatters, Auckland, New Zealand), in the Mauve options, change the alignment algorithm to MCM (the Mauve Contig Mover) alignment. The borders of small single-copy (SSC), large single copy (LSC), and inverted repeat (IR) regions among Myristicaceae species were visually displayed and compared using Irscope [27] based on these species annotation GenBank (.gb) files. To detect the hotspots of the intergenetic divergence, 74 coding sequences, and 41 intergenic space sequences were extracted for each species using Phylosuite ver. 1.1.152 [28] and aligned in batches using MAFFT [29], using the-auto = strategy and the codon alignment mode. Following this step, the nucleotide diversity (Pi) value was calculated for each of the 115 loci using DnaSP ver.6.12.03 [30]. To determine whether coding genes in 10 species had selective pressure during evolution, DnaSP software [30] was used to calculated synonymous (dS) and nonsynonymous (dN) values of coding genes of 10 species as well.

Phylogenetic analysis
To obtain a more comprehensive results, 16 previously reported chloroplast genomes of Myristicaceae by Cai et al. [16] were also downloaded from GenBank (MN486685, MN486686, and MN495958-MN495971) and performed the phylogenetic analyses together with these 10 sequences in this study (the accession numbers are listed in Table 1), and using L.tulipifera as outgroup. After the alignment of all sequences with MAFFT software, the phylogenetic analyses were performed using maximum likelihood (ML) and Bayesian inference (BI) with MEGA 7.0 [31] and MrBayes version 3.2.6 [32] respectively. Conditions were set as bellow: using the model of GTR+Γ for ML, and GTR+Γ+I for BI analysis, the node support values were given in the form of posterior probability (PP) and bootstrap value (BV), when perform the BI analysis, the Markov chain Monte Carlo (MCMC) was run for 2, 000, 000 generations with two parallel searches using four chains.  Table 2). The base composition of the 10 chloroplast genome sequences were analyzed and counted, and the GC content ranged from 39.19% (M.

Repeat analysis
Tandem repeat sequences (TRSs) have an important influence in terms of gene structure, function and evolution, and so on [33]. SSRs are tandemly repeated motifs with a length of 1-6 bp, which have been widely used as molecular markers in evolutionary biology and population genetics [34,35]. To explore the genetic changes evident in the Myristicaceae species analyzed, we performed tandem repeat and SSRs analysis. In this study, we identified an average of 62 SSR loci in the complete chloroplast genome of the studied species, including 45 mononucleotide SSR loci, 5 dinucleotide SSR loci, 2 trinucleotide SSR loci, 8 tetranucletide SSR loci, and 2 pentanucleotide SSR loci. In all species, only one hexanucleotide was found in H. pandurifolia, (AATAAA)3, located in the matK~rps16 region (Fig 2A). SSR information, from trinucleotides to hexanucleotides, are displayed in Table 4, because of the large number of mononucleotide and dinucleotides SSR sites, they are listed in S1 Table. At the same time, a total of 38 types of TRSs were detected in all species, with the repeat type ranging from 16 to 24 per species ( Fig 2B); these repeats are mainly distributed in the ycf2, trnV-GAC~rps7, and trnN-GUU~trnR-ACG in the IR regions, the rps11, ndhB, petN~psbM, atpH~atpI, rpoB~trnC-GCA, atpB~rbcL, trnP-UGG~psaJ, psbZ~trnG-GCC, rpl20~rps12, rpl32~trnL-UAG, trnC-GCA~petN, trnD-GUC~trnY-GUA, and trnT-UGU~trnF-GAA in the LSC, and the ycf1, ycf1~trnN-GUU, ndhD~ccsA, ccsA~ndhF, and rpl32~trnL-UAG in the SSC. Among them, the shortest TRSs with a base sequence of TTTATATAA were detected in the species K. cinerea, K. elegans, K. furfuracea, and K. linifolia, while the longest TRS, with a base sequence of AGAAAAATGGAGACTATTTCTTTTTATTTAT was detected only in K. linifolia (Fig 3).

Genome comparison
To investigate the variation in chloroplast genome sequences, the 10 Myristicaceae species were compared with the mVISTA using the annotation sequence of L. tulipifera. The results indicated that the chloroplast genome sequences of Myristicaceae were relatively conserved, although a certain level of variation was detected. The pair of reverted repeat region was highly conserved than the LSC region and SSC region, and the PCGs were highly conserved than
Previous studies have found that the plastome sequence was conserved in flowering plants [36], although, due to evolutionary events, changes occurred in the size and boundaries of individual replicates and reverse repeats [37,38]. Comparison among the reverse repeat region, the LSC region, and the SSC region boundaries in the 10 species of Myristicaceae are presented in Fig 7. Most species exhibit some variation in the number of nucleotides in the boundaries of the LSC, IR, and SSC regions. Except H. pandurifolia, the studied Myristicaceae species had the same set of genes at the border: the rpl22 and trnH genes were located in the LSC region, the ndhF and trnR genes were located in the SSC region, and the rps19 and rrn5 genes were found in the IR regions. However, for H. pandurifolia, the trnH gene was located in the IR regions, the rps19 gene was located on the LSC/IRs border, and the positioning of the ycf1 gene in the IRb/SSC border was observed only in the genome of H. pandurifolia.

Phylogenetic analysis
To determine the phylogenetic relationship of the 26 samples of Myristicaceae, phylogenetic trees were reconstructed using the chloroplast genome of those species using BI and ML, taking L. tulipifera as the outgroup. The results showed that BI and ML analyses were congruent with high-support PP, 1.0, and MP, 100, in most relationships (Fig 8).

Discussion
Myristicaceae plants are recognized as source plants of medical, wood, and oil products. M. yunnanensis is listed as a protected plant in China and is now considered endangered [39]. Until recently, the reports of genome sequences in this family had only been published by the authors [40][41][42][43][44][45][46][47] and Cai [16]. In this study, we comprehensively analyzed the chloroplast genome of Myristicaceae species and reconstructed phylogenetic tree using 26 genome sequences. The results showed that all genomes exhibited a quadripartite structure, including LSC, small SSC, and a pair of IR regions (IRa and IRb), which consisted of 84-92 PCGs, 26-31 tRNAs, and 8 rRNAs, with genome sizes that ranged from 154,527 to 155,923 bp and with GC % contents that ranged from 39.18% to 39.24%. Compared to species from the same order Magnoliales, the organization and structure of these 10 chloroplast genomes were similar to that of the L. tulipifera [48].

PLOS ONE
Comparative chloroplast genomics and phylogenetic relationships of Myristicaceae Genome repeats play an indispensable role in gene expression, transcriptional regulation, chromosome construction, genomic structural variation, expansion, and rearrangement [49][50][51]. Similarly, we analyzed the repeat sequences of Myristicaceae, and the results are consistent with previous studies that most SSRs were located in intergenic spacers regions, followed by coding regions, and most TRSs were located in ycf genes and non-coding regions [19,[52][53][54]. The cpSSR has been widely used in phylogenetic evaluation and population genetics [55], and it is a very effective marker as well [56,57]. In the chloroplast genome of Myristicaceae, mononucleotide repeats made up over 79%, and over 95% of the mononucleotide consisted of A or T bases, and the majority of SSRs were found in the SSC and LSC regions, the proportion of mononucleotide repeats and the base composition were similar to previous research [53,55].
The variation in the chloroplast genome size was a result of the contraction and expansion of the reverse repeats (IRs) [58]. This contraction and expansion were observed in the chloroplast genome sequences of Myristicaceae. The size of the IRs ranged from 18,877 bp (H. amygdalina) to 24,077 bp (K. globularia). Despite the similar lengths of the IR regions of H. pandurifolia in relation to other Myristicaceae species, with the exception of H. amygdalina, some level of expansion and contraction were observed. Due to the different positions of the genes rps19, ycf1, and trnH, three types of variation in the border of IR-SC region appeared among these species, as a result of the contraction and expansion of reverse repeats. Type I occurred in H. pandurifolia, where parts of the genes rps19 and ycf1 were located in the IR region, and other parts were located in the LSC and SSC regions, respectively. Type II occurred in all other species, except H. pandurifolia, that the two rps19 were located in the IR regions. Type III was found in all the studied species, except H. pandurifolia, that trnH was located in the LSC region whereas trnH was located in IR regions of H. pandurifolia. All genomes had ndhF in the SSC region, rpl2 in the IR region, and rpl22 in the LSC region. dN, dS, and dN/dS were calculated to evaluate sequence diversity and purifying selection in species evolution. The results indicated low sequence diversity in most genes. dN/dS analysis showed that most protein coding genes faced negative selection (dN/dS < 1), only 11 genes (accD, ccsA, matK, ndhF, ndhG, psaA, rbcL, rpoA, rpoC2, ycf1, and ycf2) were positively selected, and the genes with positively selected in this study have been reported in Barleria prionitis [18] and Rheum species [59]. Nucleotide diversity can be used to measure mutations in the population and can also be used to estimate evolutionary relationships [60]. In recent years, DNA barcoding has been considered as a reliable tool for resolving phylogenetic relationships and species authentication [53,61]. In this study, we aligned the chloroplast genomes and found 11 and 18 mutation sites in the protein coding region and the intergenic spacer, respectively. These mutation hotspots provided valuable markers for identification of species as well as resolving phylogenetic relationships in the family.
Comparative genome analysis using mVISTA showed that the genomes are relatively conserved with minor variation, which mainly occurred in non-coding regions. Through the results of alignment, there were no considerable rearrangement being detected in the chloroplast genome.
The genome sequences were the effective resource for inferring phylogenetic relationships among species [62]. The phylogenetic relationship among the species of Myristicaceae is controversial, particularly H. pandurifolia [8,12,16]. In this study, the phylogenetic relationships of 26 Myristicaceae samples were inferred using ML and Bayesian methods. The genus Knema was clustered with the genus Myristica, and same-genus species were clustered in the same clade with a high support value, this was consistent with previously reported results which Knema species was found to share sister relationship with Myristica species [63]. In Horsfieldia, H. kingii, H. hainanensis, H. tetratepala and H. amygdalina were clustered together, however, H. pandurifolia was separated from the four Horsfieldia species forming a single clade, same as the result of Amplified Fragment Length Polymorphism (AFLP) [8]. For the morphological observation, the aril's color of H. pandurifolia was recorded as orange in Flora of China, whereas bright red as observed by Wu [8]; the seed apex of other Horsfieldia species were round, whereas H. pandurifolia was pointed, same as E. macrocoma spp. prainii; the testa color of H. pandurifolia, same as E. macrocoma, was variegated, whereas other Horsfieldia species' testa were brown [8,13]. Based on the results of fatty acid research, Wu believed that H. pandurifolia should be separated from genus Horsfieldia [8]. In summary, molecular, morphological, and fatty acid data show that H. pandurifolia should be treated as a genus [8,9].
In current study, the chloroplast genome sequences that used to analysis the phylogenetic relationship included all species except for Myristica cagayanensis Merr. And Myristica simiarum A.DC. distributed in China, which can better explain the phylogenetic relationships of the Chinese species of this family. This study suggests taxonomical revision that H. pandurifolia should be separated from the genus Horsfieldia and placed in the genus Endocomia (1984) proposed by W.J.de Wilde [9]. In addition, morphological observations of genus Knema by the authors showed that the flowers and leaves of K. cinerea and K. elegans are similar. Based on the phylogenetic tree, K. furfuracea and K. linifolia had a relatively close phylogenetic relationship, as the same result occurred in both K. cinerea and K. elegans. We will do further research in the future on the phylogenetic relationship of genus Knema.

Conclusion
In this study, we sequenced two chloroplast genomes of Knema species, K. globularia and K. cinerea, and compared them with those of eight reported species. The size, gene content and the structure of Myristicaceae chloroplast genomes were similar, and comparative analysis revealed no gene inversion or relocation among chloroplast genome. We identified eleven highly variable sites, with seven intergenic spacer region and four PCGs, that could be explored to create valuable genetic markers for species authentication and phylogeny in Myristicaceae.
Through the comparison of the genomes, we identified 11 genes and 18 intergenic spacer regions were positively selected which can be used to analyze the population genetic structure of Myristicaceae. We comprehensively analyzed the phylogenetic relationship of Myristicaceae using 26 samples from 3 genera distributed in China. We suggesting taxonomical revision that H. pandurifolia should be separated from Horsfieldia and placed in the genus Endocomia.
Supporting information S1 Table. Mononucleotide and dinucleotides SSR site information of ten species of Myristicaceae.